Homework 3: Fake News Classification
This really says a lot about our society.
Ah yes, here’s something that everyone who’s doing machine learning learning has done: trying to classify fake news based on its contents. It’s a very natural thing to try since you hear about it all the time, and maybe it’s something simple enough that machines can pick out with just ameteur level resources and skills.
Even putting aside the importance in society, a large part of the reason this is such a popular machine learning exercise is how simple it is to frame as a machine learning task. Our predictors (what the machine will be privy to) is of course exactly the information you’d be seeing if you were scrolling through your news articles: the headlines and the body. And of course, the objective will be 0/1 true/fake labels.
Speaking of which, I should note: the original dataset came from Kaggle where it has already been labeled (information from a paper on this topic), but it has been reconfigured and cleaned and for us. Luckily, that better version is directly available on the internet on Github, so you can directly follow this blog post if you’d like.
Either way, one thing we’ll be exploring in this tutorial other than just how to build and run the model, is actually the very common question around any machine learning task, which is exactly how much information should the machine know about? It might seem natural that the more information the better, but of course, that is the trap of overfitting. (There’s a nice simple Stack Exchange on this topic that explains it quite consisely, and our professor has
First thing’s first, make sure you have at least TensorFlow 2.4 installed, otherwise you won’t be able to use some of the text vectorization tools we’re going to need.
pip install tensorflow==2.4
We’re going to start off then like any good machine learning tutorial should: importing tensorflow + friends numpy and pandas.
import tensorflow as tf
import numpy as np
import pandas as pd
from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow import keras
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.layers.experimental.preprocessing import StringLookup
# for embedding viz
import plotly.express as px
import plotly.io as pio
pio.templates.default = "plotly_white"
Then we can actually import the dataset to see what we’re dealing with.
train_url = "https://github.com/PhilChodrow/PIC16b/blob/master/datasets/fake_news_train.csv?raw=true"
df = pd.read_csv(train_url)[["title","text","fake"]] # only need these columns
df.head() # the other column is just some index
| title | text | fake | |
|---|---|---|---|
| 0 | Merkel: Strong result for Austria's FPO 'big c... | German Chancellor Angela Merkel said on Monday... | 0 |
| 1 | Trump says Pence will lead voter fraud panel | WEST PALM BEACH, Fla.President Donald Trump sa... | 0 |
| 2 | JUST IN: SUSPECTED LEAKER and “Close Confidant... | On December 5, 2017, Circa s Sara Carter warne... | 1 |
| 3 | Thyssenkrupp has offered help to Argentina ove... | Germany s Thyssenkrupp, has offered assistance... | 0 |
| 4 | Trump say appeals court decision on travel ban... | President Donald Trump on Thursday called the ... | 0 |
The data is already pretty nice and clean, but there’s one more step we should take at the outset: removing stop words. These are words like is, the, and at which are really common grammatically but don’t usually contribute to the overall meaning of text.
Of course, we’ll be relying on a list of standard stop words someone else has made, in particular, one from nltk (Natural Language Toolkit). In case you don’t already have this, you’ll have to download their list of stopwords.
import nltk
nltk.download('stopwords')
[nltk_data] Downloading package stopwords to
[nltk_data] C:\Users\Michael\AppData\Roaming\nltk_data...
[nltk_data] Package stopwords is already up-to-date!
True
from nltk.corpus import stopwords
stop = stopwords.words('english')
And this really is just a Python list of words we can remove.
stop[:10]
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
We know how to work with lists, so removing everything in here from each entry in df["text"] is as easy as using an apply.
clean_text = df["text"].apply(lambda x: " ".join([item for item in x.split() if item not in stop])).to_frame()
(We need that to_frame at the end there to Series (column) resulting from the apply to a fully fledged DataFrame, because this is what TensorFlow wants.)
Essentially, this breaks apart each entire text body into a list of words, throws out ones that are stop words, and stitches it back together.
clean_text.iloc[0].text
'German Chancellor Angela Merkel said Monday strong showing Austria anti-immigrant Freedom Party (FPO) Sunday election big challenge parties. Speaking news conference Berlin, Merkel added hoping close cooperation Austria conservative election winner Sebastian Kurz European level.'
So as you can see, the text is a little less human-readable now, but all the important words are there, and that’s all the computer needs.
Create a (Tensorflow) Dataset
As another step in our data preparation, we’re going to create a Tensorflow Dataset. These are really nice in that it “abstracts” (and you’ll hear that term the more you work around specialized computer software stuff like this) away the nitty-gritty of having many different elements in a machine learning pipeline, and simply allows us keep lots of differently shaped data all in one place that’s easy to reference.
We won’t be using the full power of what these Datasets can do, but the nicest way in which they help us here is that we’ll use it to help us answer that question we posed in the beginning where we were interested in which features to include. So that naturally means we’re going to want to include a separate inputs for the headline titles, and one for body text. (And then of course, the labels of fakeness as our answer key).
data = tf.data.Dataset.from_tensor_slices(
(
{
"title" : df[["title"]],
"text" : clean_text
},
{
"fake" : df[["fake"]]
}
)
)
Test-Train Split
Again, as is tradition in machine learning, we’ll split our entire Dataset into testing, training, and validation subsets to more easily control what our model has access to, as well as to have unseen data against which to
This is mainly for the purposes of avoiding overfitting again, but a much more detailed discussion on why this splitting helps can once again be found in Professor Chodrow’s lecture notes.
But the crux is that we’re going to pick a random 70% of the data for training, 10% for validation, and the rest will be testing data.
train_size = int(0.7*len(data))
val_size = int(0.1*len(data))
The Dataset class has really nice built in functionality for this. To ensure taking random chunks, we can first shuffle the data around using
data = data.shuffle(buffer_size = len(data))
Then finally, we take the first 70% as our training set, and to get the next 20%, we want to skip over the first 70% reserved for training. This is easily accomplished with the skip method, and finally, using batch on each helps split into different, well batches, each of which will be used as we run through several epochs (passes) of training the model.
train = data.take(train_size).batch(20)
val = data.skip(train_size).take(val_size).batch(20)
test = data.skip(train_size + val_size).batch(20)
len(train), len(val), len(test)
(786, 113, 225)
Neural Network Layers
The main way in which we’ll train our models is with neural networks. These are once again a much huger topic than I’ll go into here, but allow me to go on a little rant. Most of the people trying to advertise all this machine learning stuff to new learners are always saying things like “oh yes, neural networks/regression/random forests are actually really simple”. But you and I know is a blatant lie, do you really think people would devote their lives to research on machine learning if regression really was just “fitting a curve”? Most of the time, people are just hiding lots of the scarier details to entice people, but this is in my opinion why so many people get turned off of machine learning learning nowadays because their expectations for working with TensorFlow and stuff are built up really high when there really isn’t a button that just says “do regression on my data, please”.
All this to say: I’m acknowledging I’m not going to do all the machine learning explanations here much justice, but I’ll present the headlines for each topic.
So as far as this tutorial is concerned,
A neural network consists of a bunch of “layers” (calculations with weights) the data is pushed through.
and we’ll let TensorFlow take care of the rest.
The “machine learning” just comes in when we readjust those association weights to minimize the difference between our predictions (what we have at the last layer) and the true labels we know externally.
And layers aren’t all the same either, they can have their own specialized roles to play in how the data is processed, affording us lots of flexibility in tuning modular parts of the whole model. Of course then a model is only as good as the layers we put together for it, so we’ll want to make we give them enough power to process our text documents well. The critical layers we’ll be using in our models are vectorization and embedding.
Text Processing
Before we actually talk about vectorization, we need to create a standardization function to make sure our text doesn’t have a lot of meaningless noise that could confuse the computer still lying around after removing stop words. While stop words might have removed meaningless parts of vocabulary and grammar, there still might be things like difference in punctuation and capitalization (for instance, you may write “Machine Learning” while I say “machine learning”, but we still want to count those the same).
To make a nice function that does this for us.
import re
import string
def standardization(input_data):
lowercase = tf.strings.lower(input_data)
no_punctuation = tf.strings.regex_replace(lowercase,
'[%s]' % re.escape(string.punctuation),'')
return no_punctuation
For example,
standardization("Hoppin' out the Wraith, esskeetit").numpy()
b'hoppin out the wraith esskeetit'
Now we’re ready for the actual vectorization. As I mentioned in the previous blog post, we like to put our data into formats computers/math like to work with, like vectors or matrices. One way of doing this is by ranking every single important word by how freqently it appears in our data. For example, one “datapoint” of text:
Aliens Sighted ~in~ LA
could be transformed into
[1900 900 500]
(“in” is probably a stop word).
where “aliens” is the 1900th most common word, “sighted” is the 900th, and “LA” is 500th.
And if you think about it, this is actually a really nice way to condense information about what a word means to humans into a really nice format for computers. For example, if “aliens” is the 1900th word in frequency out of 2000, and are hardly mentioned in news that we’re pretty sure is real, and then all of a sudden is the 10th most common unique word in another article, that document would be rather sus.
It’s nice to know about how the data is being transformed, but all the technical details will be taken care of by that TensorFlow TextVectorization layer we imported earlier. You can read its documentation, but these are the arguments we need to provide:
# only the top distinct words will be tracked
max_tokens = 2000
# each headline will be a vector of length 25
sequence_length = 25
vectorize_layer = TextVectorization(
standardize=standardization,
max_tokens=max_tokens, # only consider this many words
output_mode='int',
output_sequence_length=sequence_length)
The main highlights are that we need to provide a max_tokens, which indicates how many words we’ll have in our “vocabulary” (really rare words like Thyssenkrupp probably won’t help our model learn too much), the standardize parameter that takes in the standardization function we built earlier, and the output mode of int to indicate we wanted those numeric rankings.
And then getting the vectorization layer to calculate all those overall rankings for our data is as simple as using its adapt function on our input text.
(For whoever’s giving feedback): I still need to figure out if I should change the vectorization layer when changing the input data, i.e. having it adapt to headlines, title, maybe even both. I’m not sure it’d be worth the time for me to do what a real person would probably do, and see which one does the best for each…
headlines = train.map(lambda x, y: x["title"])
vectorize_layer.adapt(headlines)
Note, remember what I was saying earlier about how layers in neural networks can have their own special purpose?
Finally, we’ll create keras.Input objects to distinguish between places where we input title (headline) data, and text (body) data. Even though they’re both string data, it’s nice that we have two different “categorizations” of the data, because these are fundamentally different objects, where words that appear in one place might not mean the same as the same words appearing in the other place.
title_input = keras.Input(
shape = (1,),
name = "title",
dtype = "string"
)
text_input = keras.Input(
shape = (1,),
name = "text",
dtype = "string"
)
Modeling
Now we can finally start putting together the models, which will once again be constructed form putting together many layers, which involves instantiating lots of different layers, and repeatedly calling them on our input. We’ll make this into a function, because we’ll be using the same overall layer architecture for all the model experiments we’ll end up creating.
Note that we can actually name layers for easy reference later. In fact, it’s important that we name the last output layer the same as the name of our labels fake so that TensorFlow knows that those really are the final outputs.
Only Headlines
def create_layers(input_data, embedding_layer=None):
if not embedding_layer:
embedding_layer = layers.Embedding(max_tokens, 10, name = "embedding")
features = vectorize_layer(input_data)
features = embedding_layer(features)
features = layers.Dropout(0.2)(features)
features = layers.GlobalAveragePooling1D()(features)
features = layers.Dropout(0.2)(features)
features = layers.Dense(32, activation='relu')(features)
output = layers.Dense(2, name = "fake" if not embedding_layer else None)(features)
return output
A couple numbers to note here: the Embedding layer has 10 dimensions, and the final Dense layer has 2, the number of “classifications” for each article (fake or real).
In case you’re wondering what those layers Dropout and Dense mean, those are less important than the ones I highlighted already, but the highlights are that
- Dense is a kind of layer they call “densely connected”, where every input node is connected to every output node, or in other words, every input affects every part of the next layer.
- Dropout is a technique where randomly selected neurons are “dropped” during training randomly. This means that their contribution to the activation of later layers is temporally removed on the forward pass and any weight updates are not applied to the neuron. This might sound like a bad thing, but the overall reward is that the network becomes less sensitive to the specific weights of nodes, i.e., less sensitive to overfitting the training data.
Either way, now that we have the output as its own variable after stacking several layers before it, we can feed it to a keras class that will scaffold a fully fledged model around our specified layers.
only_title_model = keras.Model(
inputs = [title_input],
outputs = create_layers(title_input)
)
Then we have to “compile the model”, turn all the layers into a single “machine” that the TensowFlow can feed and tune to help it learn. The important part here is that we also specify a optimzer+loss function to perform the learning. (Those concepts are again explained in the PIC16A lecture notes).
only_title_model.compile(optimizer = "adam",
loss = losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy']
)
Then finally, we can perform the “learning”, or fitting. We feed the training data
history = only_title_model.fit(train, epochs = 20, validation_data = val)
Epoch 1/20
D:\anaconda3\envs\PIC16B\lib\site-packages\tensorflow\python\keras\engine\functional.py:595: UserWarning: Input dict contained keys ['text'] which did not match any model input. They will be ignored by the model.
[n for n in tensors.keys() if n not in ref_input_names])
786/786 [==============================] - 3s 2ms/step - loss: 0.4934 - accuracy: 0.8084 - val_loss: 0.1311 - val_accuracy: 0.9559
Epoch 2/20
786/786 [==============================] - 1s 2ms/step - loss: 0.1435 - accuracy: 0.9469 - val_loss: 0.0976 - val_accuracy: 0.9666
Epoch 3/20
786/786 [==============================] - 1s 2ms/step - loss: 0.1083 - accuracy: 0.9624 - val_loss: 0.0965 - val_accuracy: 0.9688
Epoch 4/20
786/786 [==============================] - 1s 2ms/step - loss: 0.0962 - accuracy: 0.9661 - val_loss: 0.0818 - val_accuracy: 0.9733
Epoch 5/20
786/786 [==============================] - 1s 2ms/step - loss: 0.0806 - accuracy: 0.9705 - val_loss: 0.0598 - val_accuracy: 0.9791
Epoch 6/20
786/786 [==============================] - 1s 2ms/step - loss: 0.0802 - accuracy: 0.9717 - val_loss: 0.0783 - val_accuracy: 0.9746
Epoch 7/20
786/786 [==============================] - 1s 2ms/step - loss: 0.0811 - accuracy: 0.9707 - val_loss: 0.0721 - val_accuracy: 0.9701
Epoch 8/20
786/786 [==============================] - 1s 2ms/step - loss: 0.0757 - accuracy: 0.9722 - val_loss: 0.0674 - val_accuracy: 0.9786
Epoch 9/20
786/786 [==============================] - 1s 2ms/step - loss: 0.0753 - accuracy: 0.9747 - val_loss: 0.0619 - val_accuracy: 0.9759
Epoch 10/20
786/786 [==============================] - 1s 2ms/step - loss: 0.0730 - accuracy: 0.9744 - val_loss: 0.0649 - val_accuracy: 0.9742
Epoch 11/20
786/786 [==============================] - 1s 2ms/step - loss: 0.0708 - accuracy: 0.9748 - val_loss: 0.0648 - val_accuracy: 0.9759
Epoch 12/20
786/786 [==============================] - 1s 2ms/step - loss: 0.0744 - accuracy: 0.9728 - val_loss: 0.0634 - val_accuracy: 0.9768
Epoch 13/20
786/786 [==============================] - 2s 2ms/step - loss: 0.0639 - accuracy: 0.9772 - val_loss: 0.0641 - val_accuracy: 0.9764
Epoch 14/20
786/786 [==============================] - 2s 2ms/step - loss: 0.0739 - accuracy: 0.9750 - val_loss: 0.0630 - val_accuracy: 0.9782
Epoch 15/20
786/786 [==============================] - 1s 2ms/step - loss: 0.0664 - accuracy: 0.9765 - val_loss: 0.0577 - val_accuracy: 0.9773
Epoch 16/20
786/786 [==============================] - 1s 2ms/step - loss: 0.0615 - accuracy: 0.9782 - val_loss: 0.0368 - val_accuracy: 0.9862
Epoch 17/20
786/786 [==============================] - 1s 2ms/step - loss: 0.0582 - accuracy: 0.9795 - val_loss: 0.0483 - val_accuracy: 0.9840
Epoch 18/20
786/786 [==============================] - 1s 2ms/step - loss: 0.0672 - accuracy: 0.9752 - val_loss: 0.0643 - val_accuracy: 0.9768
Epoch 19/20
786/786 [==============================] - 1s 2ms/step - loss: 0.0605 - accuracy: 0.9790 - val_loss: 0.0515 - val_accuracy: 0.9835
Epoch 20/20
786/786 [==============================] - 2s 2ms/step - loss: 0.0620 - accuracy: 0.9774 - val_loss: 0.0564 - val_accuracy: 0.9808
from matplotlib import pyplot as plt
plt.plot(history.history["accuracy"], label = "training")
plt.plot(history.history["val_accuracy"], label = "validation")
plt.gca().set(xlabel = "epoch", ylabel = "accuracy")
plt.legend()
<matplotlib.legend.Legend at 0x1bf9a722f88>

Predictions on Unseen Data
With all the abstraction, we can now just ask TensorFlow to apply what our model learned to the unseen testing data with a single “button press”.
only_title_model.evaluate(test)
225/225 [==============================] - 0s 1ms/step - loss: 0.0483 - accuracy: 0.9846
[0.04828381538391113, 0.9846359491348267]
Only Text
Now, we’ll perform the the exact same process with this time the text only.
text_features = vectorize_layer(text_input)
text_features = layers.Embedding(max_tokens, 10, name = "embedding")(text_features)
text_features = layers.Dropout(0.2)(text_features)
text_features = layers.GlobalAveragePooling1D()(text_features)
text_features = layers.Dropout(0.2)(text_features)
text_features = layers.Dense(32, activation='relu')(text_features)
output = layers.Dense(2, name = "fake")(text_features)
only_text_model = keras.Model(
inputs = [text_input],
outputs = create_layers(text_input)
)
only_text_model.compile(optimizer = "adam",
loss = losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy']
)
history = only_text_model.fit(train, epochs = 20, validation_data = val)
Epoch 1/20
D:\anaconda3\envs\PIC16B\lib\site-packages\tensorflow\python\keras\engine\functional.py:595: UserWarning: Input dict contained keys ['title'] which did not match any model input. They will be ignored by the model.
[n for n in tensors.keys() if n not in ref_input_names])
786/786 [==============================] - 5s 4ms/step - loss: 0.4692 - accuracy: 0.7589 - val_loss: 0.1324 - val_accuracy: 0.9496
Epoch 2/20
786/786 [==============================] - 3s 4ms/step - loss: 0.1344 - accuracy: 0.9479 - val_loss: 0.1053 - val_accuracy: 0.9563
Epoch 3/20
786/786 [==============================] - 3s 4ms/step - loss: 0.1138 - accuracy: 0.9560 - val_loss: 0.1007 - val_accuracy: 0.9657
Epoch 4/20
786/786 [==============================] - 3s 3ms/step - loss: 0.1050 - accuracy: 0.9600 - val_loss: 0.0926 - val_accuracy: 0.9666
Epoch 5/20
786/786 [==============================] - 3s 4ms/step - loss: 0.0999 - accuracy: 0.9624 - val_loss: 0.0851 - val_accuracy: 0.9701
Epoch 6/20
786/786 [==============================] - 3s 4ms/step - loss: 0.1033 - accuracy: 0.9624 - val_loss: 0.0795 - val_accuracy: 0.9719
Epoch 7/20
786/786 [==============================] - 3s 4ms/step - loss: 0.0934 - accuracy: 0.9657 - val_loss: 0.0819 - val_accuracy: 0.9719
Epoch 8/20
786/786 [==============================] - 3s 4ms/step - loss: 0.0896 - accuracy: 0.9679 - val_loss: 0.0825 - val_accuracy: 0.9706
Epoch 9/20
786/786 [==============================] - 3s 4ms/step - loss: 0.0863 - accuracy: 0.9673 - val_loss: 0.0664 - val_accuracy: 0.9786
Epoch 10/20
786/786 [==============================] - 4s 4ms/step - loss: 0.0974 - accuracy: 0.9639 - val_loss: 0.0630 - val_accuracy: 0.9759
Epoch 11/20
786/786 [==============================] - 4s 4ms/step - loss: 0.0847 - accuracy: 0.9690 - val_loss: 0.0840 - val_accuracy: 0.9715
Epoch 12/20
786/786 [==============================] - 3s 4ms/step - loss: 0.0854 - accuracy: 0.9685 - val_loss: 0.0661 - val_accuracy: 0.9724
Epoch 13/20
786/786 [==============================] - 3s 4ms/step - loss: 0.0858 - accuracy: 0.9691 - val_loss: 0.0692 - val_accuracy: 0.9791
Epoch 14/20
786/786 [==============================] - 3s 4ms/step - loss: 0.0841 - accuracy: 0.9717 - val_loss: 0.0824 - val_accuracy: 0.9706
Epoch 15/20
786/786 [==============================] - 4s 5ms/step - loss: 0.0854 - accuracy: 0.9704 - val_loss: 0.0817 - val_accuracy: 0.9742
Epoch 16/20
786/786 [==============================] - 3s 4ms/step - loss: 0.0878 - accuracy: 0.9678 - val_loss: 0.0641 - val_accuracy: 0.9791
Epoch 17/20
786/786 [==============================] - 3s 4ms/step - loss: 0.0843 - accuracy: 0.9685 - val_loss: 0.0699 - val_accuracy: 0.9791
Epoch 18/20
786/786 [==============================] - 3s 4ms/step - loss: 0.0829 - accuracy: 0.9705 - val_loss: 0.0581 - val_accuracy: 0.9804
Epoch 19/20
786/786 [==============================] - 3s 4ms/step - loss: 0.0814 - accuracy: 0.9701 - val_loss: 0.0836 - val_accuracy: 0.9715
Epoch 20/20
786/786 [==============================] - 3s 4ms/step - loss: 0.0847 - accuracy: 0.9687 - val_loss: 0.0649 - val_accuracy: 0.9813
from matplotlib import pyplot as plt
plt.plot(history.history["accuracy"], label = "training")
plt.plot(history.history["val_accuracy"], label = "validation")
plt.gca().set(xlabel = "epoch", ylabel = "accuracy")
plt.legend()
<matplotlib.legend.Legend at 0x1bf9aec71c8>

only_text_model.evaluate(test)
225/225 [==============================] - 1s 3ms/step - loss: 0.0669 - accuracy: 0.9768
[0.06688765436410904, 0.9768425822257996]
Both Text and Title
shared_embedding = layers.Embedding(max_tokens, 10, name = "embedding")
title_features = create_layers(title_input, embedding_layer=shared_embedding)
text_features = create_layers(title_input, embedding_layer=shared_embedding)
main = layers.concatenate([title_features, text_features], axis = 1)
main = layers.Dense(32, activation='relu')(main)
output = layers.Dense(2, name = "fake")(main)
title_and_text_model = keras.Model(
inputs = [title_input, text_input],
outputs = output
)
title_and_text_model.compile(optimizer = "adam",
loss = losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy']
)
history = title_and_text_model.fit(train, epochs = 20, validation_data = val)
Epoch 1/20
786/786 [==============================] - 5s 4ms/step - loss: 0.4035 - accuracy: 0.8321 - val_loss: 0.1233 - val_accuracy: 0.9505
Epoch 2/20
786/786 [==============================] - 3s 3ms/step - loss: 0.1206 - accuracy: 0.9594 - val_loss: 0.0814 - val_accuracy: 0.9693
Epoch 3/20
786/786 [==============================] - 3s 3ms/step - loss: 0.0973 - accuracy: 0.9646 - val_loss: 0.0908 - val_accuracy: 0.9652
Epoch 4/20
786/786 [==============================] - 3s 3ms/step - loss: 0.0934 - accuracy: 0.9684 - val_loss: 0.0829 - val_accuracy: 0.9706
Epoch 5/20
786/786 [==============================] - 2s 3ms/step - loss: 0.0848 - accuracy: 0.9696 - val_loss: 0.0618 - val_accuracy: 0.9822
Epoch 6/20
786/786 [==============================] - 2s 3ms/step - loss: 0.0775 - accuracy: 0.9727 - val_loss: 0.0586 - val_accuracy: 0.9808
Epoch 7/20
786/786 [==============================] - 2s 2ms/step - loss: 0.0736 - accuracy: 0.9745 - val_loss: 0.0678 - val_accuracy: 0.9768
Epoch 8/20
786/786 [==============================] - 2s 3ms/step - loss: 0.0719 - accuracy: 0.9732 - val_loss: 0.0624 - val_accuracy: 0.9782
Epoch 9/20
786/786 [==============================] - 2s 3ms/step - loss: 0.0641 - accuracy: 0.9786 - val_loss: 0.0599 - val_accuracy: 0.9822
Epoch 10/20
786/786 [==============================] - 2s 3ms/step - loss: 0.0677 - accuracy: 0.9753 - val_loss: 0.0521 - val_accuracy: 0.9822
Epoch 11/20
786/786 [==============================] - 3s 3ms/step - loss: 0.0671 - accuracy: 0.9759 - val_loss: 0.0571 - val_accuracy: 0.9795
Epoch 12/20
786/786 [==============================] - 2s 3ms/step - loss: 0.0660 - accuracy: 0.9750 - val_loss: 0.0561 - val_accuracy: 0.9795
Epoch 13/20
786/786 [==============================] - 3s 4ms/step - loss: 0.0610 - accuracy: 0.9782 - val_loss: 0.0564 - val_accuracy: 0.9817
Epoch 14/20
786/786 [==============================] - 3s 4ms/step - loss: 0.0668 - accuracy: 0.9778 - val_loss: 0.0548 - val_accuracy: 0.9817
Epoch 15/20
786/786 [==============================] - 2s 2ms/step - loss: 0.0598 - accuracy: 0.9797 - val_loss: 0.0554 - val_accuracy: 0.9799
Epoch 16/20
786/786 [==============================] - 3s 3ms/step - loss: 0.0666 - accuracy: 0.9763 - val_loss: 0.0470 - val_accuracy: 0.9840
Epoch 17/20
786/786 [==============================] - 3s 3ms/step - loss: 0.0552 - accuracy: 0.9814 - val_loss: 0.0464 - val_accuracy: 0.9853
Epoch 18/20
786/786 [==============================] - 2s 3ms/step - loss: 0.0644 - accuracy: 0.9776 - val_loss: 0.0502 - val_accuracy: 0.9826
Epoch 19/20
786/786 [==============================] - 2s 3ms/step - loss: 0.0613 - accuracy: 0.9794 - val_loss: 0.0510 - val_accuracy: 0.9826
Epoch 20/20
786/786 [==============================] - 3s 3ms/step - loss: 0.0614 - accuracy: 0.9795 - val_loss: 0.0446 - val_accuracy: 0.9875
from matplotlib import pyplot as plt
plt.plot(history.history["accuracy"], label = "training")
plt.plot(history.history["val_accuracy"], label = "validation")
plt.gca().set(xlabel = "epoch", ylabel = "accuracy")
plt.legend()
<matplotlib.legend.Legend at 0x1bf9b1e4608>

title_and_text_model.evaluate(test)
225/225 [==============================] - 1s 2ms/step - loss: 0.0523 - accuracy: 0.9840
[0.052267417311668396, 0.9839679598808289]
I suppose one warning you hear quite a bit when you’re first studying machine learning is that “too much information can actually cause more harm than good”.
But in this case, we do have amazing accuracy on our unseen data from our model that has the most information possible. One idea I’d throw out there as to why is maybe the difference “feeling” of the headline in a faker, more clickbait-y article is very dramaticized and intentionally made more inflammatory and maybe the
Embeddings
Remember that embedding layer I said was important but we’d save talking about for later? Now is later.
Let’s just start by poking around the embedding layer at the end of model fitting. This is where that naming of layers comes in, we can just ask for it by name and open up the hood a bit to see the numbers inside with
weights = title_and_text_model.get_layer('embedding').get_weights()[0] # get the weights from the embedding layer
vocab = vectorize_layer.get_vocabulary() # get the vocabulary from our data prep for later
weights
array([[ 0.01516789, -0.02696777, 0.03883211, ..., 0.02343773,
0.01760186, 0.01586471],
[-0.0337828 , 0.03754019, -0.0383072 , ..., -0.0289585 ,
-0.02819344, -0.04658148],
[-0.00865335, 0.02707106, -0.10757445, ..., 0.03221235,
0.00225151, 0.01295694],
...,
[-0.41807944, 0.38946232, -0.40882814, ..., -0.5125998 ,
-0.4375395 , -0.43943298],
[-0.3980184 , 0.5073266 , -0.5625422 , ..., -0.38761407,
-0.5029228 , -0.37710074],
[-0.46357223, 0.48129812, -0.5456448 , ..., -0.48393884,
-0.3351998 , -0.4587918 ]], dtype=float32)
weights.shape
(2000, 10)
See? I really wasn’t kidding when I said each layer really just did have a bunch of (meaningful!) numbers. In fact, they’re so meaningful that we can actually gain some really nice insights on some of the associations between words that our model built for itself.
That’s right - we’re getting into what an embedding really is. This is the layer of weights for each word in the document where the neural network was learning which words were close in association with each other, where if you imagine each word (in our 2000-word vocabulary) a point in 10-dimensional space, with points physically closer being more related.

Image credit: Towards Data Science
We can even visualize this a bit ourselves since we have all those numbers laying right there in weights for us. We humans of course have trouble visualzing the very high amount of dimensions often involved in data+machine learning, so
the standard trick is to use PCA to reduce the number of dimensions to something more reasonable.
Note: I remember always hearing the term “dimension reduction” in machine learning circles, and being intimidated by it. I’m not going to pull the “but it’s a really simple concept” again, so instead I’ll just point you to this now somewhat famous PCA tutorial video that steps through the process at a nice learner’s pace.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
weights = pca.fit_transform(weights)
Now we’ll make a data frame from our results:
embedding_df = pd.DataFrame({
'word' : vocab,
'x0' : weights[:,0],
'x1' : weights[:,1]
})
embedding_df
| word | x0 | x1 | |
|---|---|---|---|
| 0 | -0.151776 | -0.027071 | |
| 1 | [UNK] | 0.046081 | -0.015569 |
| 2 | to | -0.034870 | 0.088164 |
| 3 | trump | -0.028481 | -0.045172 |
| 4 | in | -0.059436 | -0.018053 |
| ... | ... | ... | ... |
| 1995 | 14 | 0.415547 | -0.119405 |
| 1996 | “it’s | 0.978748 | 0.099301 |
| 1997 | “hillary | 1.296868 | -0.076418 |
| 1998 | “he | 1.399378 | 0.102498 |
| 1999 | “black | 1.379671 | 0.037937 |
2000 rows × 3 columns
Ready to plot! Note that the embedding appear to be “stretched out” in three directions, with one direction corresponding to each of the three categories (tech, style, science).
import plotly.express as px
fig = px.scatter(embedding_df,
x = "x0",
y = "x1",
size = list(np.ones(len(embedding_df))),
size_max = 2,
hover_name = "word")
fig.show()
Isn’t that neat? The part that I’m most impressed of course that a machine could learn all this on its own (learning a lot about how humans speak even though we didn’t give it a dictionary or tell it about how language works or anything), but it could also make neat associations between words that were both close in meaning, or even was the subject of news that was closely related.
For example, something interesting we can take a look at is how where certain categories of words are physically located in this cloud.
far_foreign = ["vietnam", "phillipines", "chinas", "syria", "beijing", "brazil"]
domestic = ["us", "usa", "america", "american", "states", "hillary", "trump"]
def gender_mapper(x):
if x in foreign:
return 1
elif x in domestic:
return 4
else:
return 0
embedding_df["highlight"] = embedding_df["word"].apply(gender_mapper)
embedding_df["size"] = np.array(1.0 + 50*(embedding_df["highlight"] > 0))
import plotly.express as px
fig = px.scatter(embedding_df,
x = "x0",
y = "x1",
color = "highlight",
size = list(embedding_df["size"]),
size_max = 10,
hover_name = "word")
fig.show()
Interestingly, the one purple (foreign) word on the right hand side of the cloud is “Syria”. Perhaps this speaks a bit to the different nature of articles having to do with Syria (lots of foreign participants in war, instability) rather than the more standard foreign politics discussed with the other nations (trade/standard news there).
TODO: what the heck does this mean?? There are some oranges close to yellow, but it’s not too clear. Is there a better question to ask??